Open In Colab

In [ ]:
!pip install numpy scanpy leidenalg scvelo cellrank

Single-cell RNAseq data analysis project (42pts)¶

The ability to profile mRNA molecules in a single cell via single-cell RNA-sequencing (scRNA-seq) has revolutionized biology. First published in 2009, scRNA-seq has been around for the last 15 years in which it has been used to, e.i. characterize how organs develop in human fetuses, how whole organisms develop from a single cell, and how pancreatic beta cells dedifferentiate in type 2 diabetes, amongst many other discoveries. Today single-cell profiling methods have been named Nature Methods method of the year already 3 times in this time (e.i. long-read sequencing, spatial transcriptomics). In this worksheet we will guide you through a simple scRNA-seq analysis pipeline. Each one of the discoveries above used extensions of the basic principles in this worksheet.

In this exercise, we will leverage the python package scanpy. If you are not familiar with this package, please check out its documentation.

The tasks will guide you through Quality Control, Normalization, Feature Selection, Clustering, Annotation and Differential Expression Analysis. More details and background information about the tasks can be found in the sc-best-practices book.

To get started: Copy the raw data set from the lecture gDrive folder to your Google drive.

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scanpy as sc

To load the data we need to mount our google drive

In [4]:
# from google.colab import drive --> commented out because this notebook was modified in pycharm rather than colab
# drive.mount('/content/drive/')

In this exercise, we will use a dataset describing the endocrine development in the pancreas from Bastidas-Ponce et al. (2018).

Adjusting the path: Adjust the path below according to where you saved it on your Google drive.

In [5]:
# Modify path according to your file location --> notebook modified in pycharm, so file was uploaded directly rather than via drive
# adata = sc.read_h5ad('drive/MyDrive/PhD/2023_gobi/endocrinogenesis_day15_gobi.h5ad')
adata = sc.read_h5ad('endocrinogenesis_day15_gobi.h5ad')

With scanpy we load an annotated data matrix. This matrix has the following structure. Throught the exercise all analysis results will be stored in the annotated data matrix. If you want to read more about this data format, check AnnData documentation.

No description has been provided for this image

The main component of an AnnData object is the data matrix X of shape #observations x #variables, so number of cells x number of genes. Gene names and additional information on gene level are stored in var. Cell level, e.g. cell type annotation, is stored in obs. Unstructured annotations of the dataset of any form are stored in uns.

In [6]:
adata
Out[6]:
AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters'

The dataset consists of 3,696 cells across 27,998 genes.

1.1. Preprocessing and Quality Control (6pts)¶

Next we will look into Preprocessing and Quality Control, read along in the Chapter 6 of the sc-best-practices book.

Task: Show the genes that yield the highest fraction of counts in each single cell, across all cells (hint: sc.pl.highest_expr_genes). What type of genes do you mainly see? You can look up gene names on genecards.org

In [7]:
# done: 1.1.1 Add code (1pt - correct function)
sc.pl.highest_expr_genes(adata)
'''
- we mainly see Pyy (codes for neuropeptide y), Iapp (codes for a member of the calcitonin family of peptide hormone), and Gnas (expression pattern)
- Pyy and Iapp are involved in appetite regulation, insulin secretion, and glucose metabolism
- this suggests that the cells are pancreatic or similar
'''
No description has been provided for this image

Data quality control (QC) can be split into cell QC and gene QC. Typical quality measures for assessing the quality of a cell include the number of molecule counts (UMIs), the number of expressed genes, and the fraction of counts that are mitochondrial. A high fraction of mitochondrial reads being picked up can indicate cell stress, as there is a low proportion of nuclear mRNA in the cell. It should be noted that high mitochondrial RNA fractions can also be biological signals indicating elevated respiration.

In [8]:
adata.var['mt'] = adata.var_names.str.startswith('mt-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
In [9]:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
No description has been provided for this image
In [10]:
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='pct_counts_mt')
No description has been provided for this image

On this plot, the x-axis indicates how many transcript counts are assigned to a single cell. The y-axis indicates how many genes in an observed cell have at least one transcript count. The dots are the single cells. The color scheme is according to the percentage of mitochondrial transcript counts per individual cell's transcript counts.

1.1.2. ToDo answer question (2pt)

Question:

  • a) which cells might not meet our quality criteria, according to earlier descriptions? Please describe the section on the scatterplot (e.g. bottom right corner and blue dots) where these cells appear
  • b) what is happening to them biologically?

Answer:

  • a) cells which have a low total gene count but a high percentage of mitochondrial transcripts (so yellow-coloured dots that are towards the left / bottom left of the plot)
  • b) they could by dying

1.1.3. Task: (2pts) Plot the distribution of total_counts by using sns.histplot.

  • a) What discrete distribution could describe this data?
  • b) Filter the data for cells that have less than 4% mitochondrial counts. Consider to have a look at adata.obs['pct_counts_mt'] for this
  • c) Filter the data for cells that have at least 1200 detected genes (see sc.pp.filter_cells
In [28]:
# done: Add code
sns.histplot(data = adata.obs['total_counts'])
plt.xlabel("cells")
plt.show()

# a) It looks like a Poisson distribution.
# b)
adata = adata[adata.obs['pct_counts_mt'] < 4].copy()
print(f'Number of cells after filter for cells that have less than 4% mt counts: {adata.n_obs}')
# c)
sc.pp.filter_cells(adata, min_genes=1200)
print(f'Number of cells after filter for minimum 1200 detected genes: {adata.n_obs}')
No description has been provided for this image
Number of cells after filter for cells that have less than 4% mt counts: 3695
Number of cells after filter for minimum 1200 detected genes: 3695

We also filter genes: here, we remove genes are filtered if they are not detected in at least 20 cells.

In [29]:
# Min 20 cells - filters out 0 count genes
sc.pp.filter_genes(adata, min_cells=20)
print(f'Number of genes after gene filter: {adata.n_vars}')
Number of genes after gene filter: 12756

**1.1.4. ** (1pt)

Question: Why are we filtering out genes that are not detected?

Answer: In general, the filtering is done to make sure we have high quality data. Because it is biological data, there can be technical errors that cause 0s in the data, so we filter out those genes to make the data smaller/ easier to handle/ faster to work with.¶

1.2. Normalization (4pts)¶

See Chapter 7 of sc-best-practices book.

Up to this point the data is only available as a count matrix. Counts are representative of molecules that were captured in the scRNA-seq experiment. As not all mRNA molecules in a cell are captured, there is a variability in the total number of counts detected between cells that results from both the number of molecules that were in the cells, and the sampling. As we cannot assume that all cells contain an equal number of molecules (cell sizes can differ substantially), we have to estimate the number of molecules that were initially in the cells. In fact, we don't estimate the exact number of molecules, but instead estimate cell-specific factors that should be proportional to the true number of molecules. These are called size factors. Normalized expression values are calculated by dividing the measured counts by the size factor for the cell, so by $x_{gc} = n_{gc} \cdot \xi_{gc}$ with $x_{gc}$ being the normalized expression values, $n_{gc}$ the UM counts and $\xi_{gc}$ the size factor per gene $g$ in cell $c$.

One basic preprocessing normalization technique assumes that all size factors are equal (library size normalization to counts per million (CPM), so $x_{gc} = n_{gc} \cdot \frac{C}{\sum_g n_{gc}}$ with $C$ a constant at a factor of 10. (see scanpy function sc.pp.normalize_total)

After normalization, data matrices are typically $\log(x+1)$‐transformed (see scanpy function sc.pp.log1p), which is performed to normalize the data distributions. The offset of 1 ensures that zero counts map to zeros.

1.2.1 Task: (2pt)

Perform CPM normalization on your dataset and subsequently $\log(x+1)$‐transform your data (see sc.pp.normalize_total and sc.pp.log1p). Plot the normalized data.

In [30]:
# Prior to normalization, we save the raw count matrix in a new Anndata object
adata_raw = adata.copy()
In [60]:
# done: Add code
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

sc.pl.violin(adata, ['total_counts', 'n_genes_by_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True)
WARNING: adata.X seems to be already log-transformed.
No description has been provided for this image

1.2.2. ToDo (1pt)

Question:

CPM Normalization gives you a relative expression value for each gene within a cell. Some genes scale with cell size, while others don't. How would normalization affect these genes differently?

Answer: It corrects differences that are based on library size --> for large cells it can reduce the counts and for small / lowly expressed cells it can scale up.

1.2.3. ToDo (1pt)

Question:

Which important effects has $\log(x+1)$ transformation with respect to the distances between expression changes?

Answer: the variance is stabilised.

1.3 Feature Selection (4pts)¶

See Chapter 8 of the sc-best-practices book

Here we will use a dispersion-based method to extract HVGs. This method first calculates the mean and a dispersion measure (variance/mean) for each gene in the dataset across all cells. The genes are then places into 20 bins based on their average expression. For each bin, we identify those genes whose expression values are highly variable compared to the other genes in the same bin by z-normalizing the dispersion measure (see scanpy function sc.pp.highly_variable_genes). scanpy stores the normalized dispersion values (the final statistics indicating the gene variability) in adata.var["dispersions_norm"]

No description has been provided for this image

1.3.1 ToDo (1pt)

Question Why do we perform feature selection?

Answer: to reduce dimensionality and choose the genes that are truly biologically meaningful.

1.3.2 Task (3pt): Extract the top 2,000 genes of your dataset with a dispersion-based method (see sc.pp.highly_variable_genes). Name the 5 highest HVGs and 5 lowest HVGs.

In [32]:
# done: Add code
sc.pp.highly_variable_genes(adata, n_top_genes = 2000)
In [46]:
hvg = adata.var.sort_values(by='dispersions_norm', ascending=False)
print(f'5 highest: {hvg.head(5).index.tolist()} \n5 lowest: {hvg.tail(5).index.tolist()}')
5 highest: ['Ifitm1', 'Cartpt', 'Npy', 'Cym', 'Gip'] 
5 lowest: ['Nabp1', 'Gsg2', 'Ticrr', '2810429I04Rik', 'Kif18b']

1.4 PCA (5pts)¶

See Chapter 9 in sc-best-practice book.

Next, we run a Principal Component Analysis (PCA) on the dataset. PCA is used to emphasize variation as well as similarity, and to bring out strong patterns in a dataset. Overall, it also serves as a method for dimensionality reduction.

The figure below shows the concepts of PCA in a nutshell.

No description has been provided for this image

1.4.1 Question: (1pt) Why do we do dimensionality reduction?

Answer: many reasons, such as noise reduction, better visualisation, lower computational cost

1.4.2 Task (2pts): Perform a PCA on the dataset and plot the results (see sc.tl.pca)

In [48]:
# done: Add code
sc.tl.pca(adata)
sc.pl.pca_scatter(adata)
No description has been provided for this image

1.4.3 Task (2pts): How many principal components should we consider if we want to keep the minimum number of principal components explaining at least 40% of the variance? (hint: sc.pl.pca_variance_ratio) ? What are the benefits and downsides of using more or less PC's for further analyses?

In [49]:
# done: Add code
sc.pl.pca_variance_ratio(adata, show=True)
print(adata.uns['pca']['variance_ratio'].cumsum())
No description has been provided for this image
[0.2047556  0.2876851  0.31970537 0.3441791  0.362314   0.37570015
 0.38570637 0.39365256 0.3999798  0.40591118 0.41133946 0.4165885
 0.42154944 0.42606288 0.4300546  0.43379    0.43734306 0.44060862
 0.4436711  0.44667003 0.44941673 0.45201772 0.45459452 0.45701474
 0.45930445 0.4615321  0.46369365 0.4657776  0.46783862 0.469869
 0.47188756 0.47383192 0.47573042 0.47762084 0.4794833  0.4813025
 0.48310313 0.4848689  0.4865969  0.4883114  0.49000332 0.49167758
 0.49332383 0.49495414 0.4965686  0.49817663 0.49975514 0.50132823
 0.50289315 0.5044372 ]

Answer: 10 PCs. Using more means getting more information, cons are computational cost as well as less noise reduction (so more noise)

Often, we select as many PCs as needed to explain up to 80% of the variance in the dataset; here we continue with 50 PCs. These have been stored in adata.varmp["PC"] by sc.tl.pca. A representation of our data in this space has been stored in adata.obsm["X_PCA"]. This is automatically chosen for computations by our next algorithm for neighbor graph computation.

1.5 Clustering (9pts)¶

See Chapter 10 in sc-best-practices book.

Clustering single-cell data makes use of a k-nearest neighbourhood graph representation of the cells (see scanpy function sc.pp.neighbors). Typically, we use the leiden algorithm which optimizes graph modularity, a cost function that defines how connected groups of nodes in the knn graph are (sc.tl.leiden). Modularity, $Q$, is defined as:

$Q = \frac{1}{2m}\sum_{i,j}[A_{ij} - \lambda \frac{k_i k_j}{2m}]\delta(\sigma_i, \sigma_j)$.

Here, $m$ is the number of edges in the graph, $A_{ij}$ is the adjacency matrix of the graph, $k_i = \sum_j A_{ij}$ is the degree of node $i$, $\sigma_i$ indicates the community assignment of node $i$, and $\delta(\sigma_i, \sigma_j)$ is a Kronecker-delta function that is 1 if $\sigma_i = \sigma_j$. $\lambda$ is a variable that is assigned by the user, called the resolution parameter.

1.5.1 Question: (1pt) What does a single-cell knn graph represent?

Answer: Each node is a single cell, connected by edges to other cells with the most similar transcripts.

1.5.2 Question: (1pt) How does the resolution parameter control the clustering resolution?

Answer: higher resolution parameter <=> more clusters

1.5.3 Question: (1pt) How does the effect of the resolution parameter differ between sparse and dense regions of the single-cell knn graph?

Answer: sparse regions might be clustered together with a lower parameter or clustered apart with a higher parameter. with denser regions more/ separate clusters are more likely either way.

1.5.4 Task: (3pts) Set up a single-cell knn-graph of the pancreas data and cluster these data as you see fit. Present your clustering results as UMAP plots coloured by cluster assignment in the final report. (see e.g. sc.tl.umap, and sc.pl.umap)

In [50]:
# For some reason, there is a numpy error while running `sc.pp.neighbors`
!pip install -U numpy
Requirement already satisfied: numpy in /home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages (2.3.4)

[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: pip install --upgrade pip
In [ ]:
# done: Add code
sc.pp.neighbors(adata)
sc.tl.leiden(adata)
sc.tl.umap(adata)
In [52]:
sc.pl.umap(adata, color="leiden") # separate cell, got some warnings with cell above
No description has been provided for this image

1.5.5 Question: (3pts)

Try a few different clustering resolutions. In some cases you will see that a cluster can be spread out across multiple places on the UMAP plot, with cells from other clusters in between. Why might this be the case? (Hint: you can explore the clustering with different resolution parameters using sc.tl.leiden and its resolution parameter)

In [54]:
# done: add code
sc.tl.leiden(adata, key_added='leiden res 0.2', resolution=0.2)
sc.tl.leiden(adata, key_added='leiden res 0.4', resolution=0.4)
sc.tl.leiden(adata, key_added='leiden res 0.6', resolution=0.6)
sc.tl.leiden(adata, key_added='leiden res 0.8', resolution=0.8)

sc.pl.umap(
    adata,
    color=['leiden res 0.2', 'leiden res 0.4', 'leiden res 0.6', 'leiden res 0.8', 'leiden'],
    legend_loc='on data',
)
No description has been provided for this image

Solution We see the example of a spread out cluster at resolution 0.25 and 0.5 (green cluster 2 in both cases). It can happen for 2 reasons:

  1. Cells of this cluster are somehow similar, but shape of the data can't be projected at 2D plane without rupturing this cluster;
  2. There are subpopulations in this cluster, but low resolution of clustering doesn't allow to split them into separate clusters. This is supported by the fact that at resolution 1 there are several clusters for these regions (11, 6, and 5).

1.6 Marker genes & Cell type annotation (9pts)¶

To interpret the clusters, we need to identify the cell types that these clusters represent. This is typically done by finding features (genes) that characterize a cluster from the data, and then comparing these features against other published studies. Someone has usually described the cell type you are looking at before. Here, we are facilitating this process by already giving you a list of cell-type markers that you can use to identify the cells.

Data-driven markers can be detected by statistical tests that assess whether a gene is more highly expressed in one cluster vs the rest. Typically we use simple tests such as a t-test or a Wilcoxon rank sum test here. These tests are implemented in scanpy under sc.tl.rank_genes_groups. We can assess whether the gene markers we find are meaningful by comparing their expression across clusters for example in a dotplot (sc.pl.dotplot) or by plotting their expression in a UMAP visualization (sc.tl.umap and sc.pl.umap).

In [55]:
ct_markers = {
    'Beta': ['Ins1', 'Ins2'],
    'Alpha': ['Gcg', 'Arx'],
    'Epsilon': ['Ghrl', 'Irs4'],
    'Delta': ['Sst', 'Hhex'],
    'Ductal': ['Sox9', 'Anxa2'],
    'Endocrine progenitor' :['Bicc1', 'Fev'],
    'Ngn3 high EP': ['Neurog3', 'Ppp1r14a'],
    'Ngn3 low EP': ['Actg1', 'Spp1']
}
In [56]:
import itertools
all_markers = list(itertools.chain(ct_markers.values()))
flat_markers = list(itertools.chain(*all_markers)) + ['leiden']
sc.pl.umap(adata, color=flat_markers)
No description has been provided for this image
In [57]:
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
No description has been provided for this image
In [58]:
sc.pl.dotplot(adata, ct_markers, groupby="leiden")
No description has been provided for this image

1.6.1 Task: (3pts)

Identify the cell types in this pancreas dataset based on the marker genes given in the dictionary ct_markers above. Present these results as a UMAP (sc.tl.umap and sc.pl.umap) coloured by cell type annotation. Note, you may have to go back to the subclustering step to improve your clustering (hint: setting the resolution parameter to 0.5 or 1 works fairly well for this dataset.)

In [62]:
# done: the 0, 1, 2, ... are the leiden cluster labels for the individual cells in adata.obs["leiden"]
# Populate this dictionary for the leiden clusters you have (likely around 15), with the cell type from ct_markers. Multiple clusters could actually have the same cell type labels!
annotation = {
    "0": "Ngn3 low EP",
    "1": "Endocrine progenitor",
    "2": "Ngn3 high EP",
    "3": "Beta",
    "4": "Ngn3 low EP",
    "5": "Alpha",
    "6": "Alpha",
    "7": "Ngn3 high EP",
    "8": "Ngn3 low EP",
    "9": "Beta",
    "10": "Ngn3 low EP",
    "11": "Epsilon",
    "12": "Ngn3 low EP",
    "13": "Delta",
    "14": "Ngn3 high EP",
    "15": "Ngn3 high EP",
}
In [63]:
adata.obs["manual_celltype_annotation"] = adata.obs["leiden"].map(annotation)
In [64]:
sc.pl.umap(adata, color=["manual_celltype_annotation", "leiden"], ncols=2)
No description has been provided for this image

1.6.2 Task: (1pt)

Find additional marker genes for each of the cell types beyond the markers that are provided. Can you find markers that are cell-type specific and some that are present across cell types?

In [ ]:
# Add code

1.6.3 Task: (2pt)

Can you find any other cellular substructures in the data that are not described well by only the cell type markers given? Try to cluster these and give marker genes for the clusters.

Note: check out the restrict_to parameter in the sc.tl.leiden function to perform a subclustering.

In [67]:
sc.tl.leiden(adata, restrict_to=('manual_celltype_annotation', ['Endocrine progenitor']), key_added='leiden_subtypes')
adata.obs['leiden_subtypes'].value_counts()
Out[67]:
leiden_subtypes
Ngn3 low EP               1367
Ngn3 high EP               663
Beta                       511
Alpha                      496
Epsilon                    175
Delta                      127
Endocrine progenitor,0      80
Endocrine progenitor,1      67
Endocrine progenitor,2      61
Endocrine progenitor,3      60
Endocrine progenitor,4      52
Endocrine progenitor,5      36
Name: count, dtype: int64
In [68]:
sc.pl.umap(adata, color='leiden_subtypes')
No description has been provided for this image
In [70]:
sc.tl.rank_genes_groups(adata, 'leiden_subtypes', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
No description has been provided for this image

1.6.4 Task: (2pts)

Remove Alpha cells from the dataset. And redo the marker gene detection for beta cells. Why do the marker genes for beta cells change?

In [72]:
# done: add code
sc.tl.rank_genes_groups(adata, 'manual_celltype_annotation', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
No description has been provided for this image
In [73]:
adata_subset = adata[adata.obs["manual_celltype_annotation"] != "Alpha"].copy()
print(adata_subset)
AnnData object with n_obs × n_vars = 3199 × 12756
    obs: 'clusters', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_counts', 'n_genes', 'leiden', 'leiden res 0.2', 'leiden res 0.4', 'leiden res 0.6', 'leiden res 0.8', 'manual_celltype_annotation', 'leiden_EP_subtypes', 'leiden_subtypes'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'leiden', 'umap', 'leiden_colors', 'leiden res 0.2', 'leiden res 0.4', 'leiden res 0.6', 'leiden res 0.8', 'leiden res 0.2_colors', 'leiden res 0.4_colors', 'leiden res 0.6_colors', 'leiden res 0.8_colors', 'rank_genes_groups', 'manual_celltype_annotation_colors', 'leiden_EP_subtypes', 'leiden_subtypes', 'leiden_subtypes_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
In [74]:
sc.tl.rank_genes_groups(adata_subset, 'manual_celltype_annotation', method='t-test')
sc.pl.rank_genes_groups(adata_subset, n_genes=25, sharey=False)
No description has been provided for this image

Answer: the markers are defined based on the other cells in the dataset.

1.6.5 Question: (1pt)

What issues can this change in marker genes cause for defining cell type markers for particular cell types? Discuss how this issue might be avoided.

Answer: different datasets can have different marker genes, even for the same cell type

1.7 Attempting annotation with default leiden clustering (3pts)¶

How many populations do you see and would be able to call from this following UMAP?

In [75]:
sc.pl.umap(adata, color='leiden')
No description has been provided for this image

Obviously the answer would be >10, however, this is a pre-defined clustering algorithm result, with a default parameter and a deterministic output. Let's inspect a bit more by visualizing the gene features that are most associated with the different subgroups (i.e. rank_genes_groups).

In [76]:
sc.pl.rank_genes_groups_dotplot(adata, groupby="leiden", figsize=[18, 4], cmap='RdBu_r', n_genes=5, standard_scale='var')
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: Alpha, Beta, Delta, etc.
No description has been provided for this image

Here pay attention that, based on hierarchical clustering (shown as tree on the right of this dotplot) one could label 3 major groups as well.

1.7.1 Task: (1pt) Adjust the cluster numbers below (in this text cell and the code) so they match your three major groups:

In [77]:
# done: extend the dictionaries starting with 3, 8, and 15 with leiden cluster numbers. Maybe you need to remove these numbers 3, 8, 15
adata.obs['cell_type_guess'] = np.where(adata.obs['leiden'].astype(int).isin({3, 5, 7, 4, 2, 11, 13}), 'A', 'N/A')
adata.obs['cell_type_guess'] = np.where(adata.obs['leiden'].astype(int).isin({12, 9, 10, 0}), 'B', adata.obs['cell_type_guess'])
adata.obs['cell_type_guess'] = np.where(adata.obs['leiden'].astype(int).isin({6, 15, 1, 8, 14}), 'C', adata.obs['cell_type_guess'])
In [78]:
sc.pl.umap(adata, color='cell_type_guess')
No description has been provided for this image

This visualization would be very similar to the one obtained when executing the leiden clustering with a lower resolution parameter.

1.7.2 Task: (1pt)

Compare this visualization with a coarser leiden clustering.

In [79]:
sc.pl.umap(adata, color=["leiden res 0.2", "cell_type_guess"])
No description has been provided for this image

1.7.3 Task: (1pt)

The red/blue dotplot above might have repeated genes. This is cause by the top-N genes reported for one group of cells partly overlapping with the top-N genes of another one. i) do you detect repeated genes the dotplot (there is at least one).

Solution Yes, for example Aplp1

1.8 Score cell cycle genes (1pt)¶

In general terms, cell cycle processes are a part of scRNA-seq datasets, and its has to be taken into account, scored and labeled.

For this, as a general processing step, cell cycle genes known to vary across different cell cycle stages, are scored per cell. Then, a very simple gating algorithm is used to define whether some cells are more likely S in phase, G2M, or G1

In [80]:
!wget https://raw.githubusercontent.com/theislab/scanpy_usage/master/180209_cell_cycle/data/regev_lab_cell_cycle_genes.txt
--2025-11-17 15:21:10--  https://raw.githubusercontent.com/theislab/scanpy_usage/master/180209_cell_cycle/data/regev_lab_cell_cycle_genes.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 556 [text/plain]
Saving to: ‘regev_lab_cell_cycle_genes.txt’

regev_lab_cell_cycl 100%[===================>]     556  --.-KB/s    in 0s      

2025-11-17 15:21:10 (25.8 MB/s) - ‘regev_lab_cell_cycle_genes.txt’ saved [556/556]

In [81]:
cell_cycle_genes = [x.strip() for x in open('regev_lab_cell_cycle_genes.txt')]
cell_cycle_genes = [x.capitalize() for x in cell_cycle_genes]
In [82]:
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]
cell_cycle_genes = [x for x in cell_cycle_genes if x in adata.var_names]
In [ ]:
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
In [84]:
# generate a coarser leiden annotation
sc.tl.leiden(adata, resolution=.25, key_added='leiden.coarse')
In [85]:
sc.pl.umap(adata, color=['leiden.coarse', 'S_score', 'G2M_score', 'phase'], ncols=2)
No description has been provided for this image

Using the cc_score, one neat observation is that some of the coarse leiden clusters are associated with cycling cells.

1.8.1 Task (1pt)

Identify the coarse clusters that refer to cycling cells and re-label them. To do so, modify the XX / 'XX' accordingly.

In [86]:
# Modify XX according to clusters corresponding to cycling cells
adata.obs['cell_type_guess'] = np.where(adata.obs['leiden.coarse'].astype(int).isin({0}), 'cycling cells', adata.obs['cell_type_guess'])
In [87]:
sc.pl.umap(adata, color='cell_type_guess')
No description has been provided for this image

Combining both hierarchical clustering, leiden and cycling cell scores into one label could be useful to keep making decisions

In [88]:
adata.obs['cell_type_guess'] = adata.obs['cell_type_guess'].astype(str) + '.' + adata.obs['leiden.coarse'].astype(str)
adata.obs['cell_type_guess'] = np.where(adata.obs['cell_type_guess'].str.startswith('cycling'), 'cycling cells', adata.obs['cell_type_guess'])
In [89]:
sc.pl.umap(adata, color='cell_type_guess')
No description has been provided for this image

Let's see rank genes groups one more time

In [90]:
sc.tl.rank_genes_groups(adata, 'cell_type_guess')
In [91]:
sc.pl.rank_genes_groups_dotplot(adata, figsize=[10, 2], cmap='RdBu_r', n_genes=5, standard_scale='var')
WARNING: dendrogram data not found (using key=dendrogram_cell_type_guess). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
No description has been provided for this image

Now we have a decently interpretable matrix, but this is still not finished. How can we possibly obtain a name for these cells?

1.9 Guessing cell clusters through an iterative process (1pt)¶

Before having a follow-up meeting with your collaborators, it would be reasonable to inspect a bit on the Biology of your system by reading up in databases where knowledge about your cells and the markers genes you're observing have been described before.

  • e.g.1. Based on GeneCards info, Fev is as an transcription factor of endocrine cells. This statement alone could be useful to recommend a cluster YYYY as an endocrine cell population.
  • e.g.2. Sox4 interacts with Neurog3 during Beta cell development Probably a cluster ZZZZ elevated in both should be updated to a Ngn3+, transitioning cluster?
In [96]:
### done: replace YYYY with a cluster that has elevated Fev
adata.obs['cell_type_guess'] = adata.obs['cell_type_guess'].str.replace('cycling cells', 'Fev2')
### done: replace ZZZZ with a cluster that has elevated Sox4 and Neurog3
adata.obs['cell_type_guess'] = adata.obs['cell_type_guess'].str.replace('C.2', 'Ngn3+')
In [97]:
sc.pl.umap(adata, color='cell_type_guess')
No description has been provided for this image

1.9.1 Task: (1pt)

Try to provide labels for at least one the of the remaining, unlabeled groups.

Tip: Endocrine-progenitor cells differentiate into specific population of pancreatic cells, among others, Alpha/Beta/Gamma/Delta/Epsilon (greek letters). See Figure 3-4 of this paper for some additional gene markers

In [ ]:
# Add code

Before next section

Before starting the DE analysis, we will update the labels of our current object with the ones that have been already previously processed by the analysis of this dataset. Those will be overwritten in our current object.

Reference for annotation

In [98]:
adata.obs['cell_type_ann'] = adata.obs['clusters']
# for learning purposes, please do not uncomment and print this visualization step until you're advanced in the cell annotation tasks above.
sc.pl.umap(adata, color='cell_type_ann')
No description has been provided for this image

2. DE analysis (5pts)¶

Differential expression analysis is a group of statistical tests that are used to establish whether there a exists a significant variation across a set of tested conditions for each gene. In its easiset form, this test can test for the difference between two distinct groups: This scenario can be handled with for example (Welch's) T-test, rank sum tests or Wald and likelihood ratio tests (LRT). Wald tests and LRT allow for more adaptive assumptions on the noise model and can therefore be more statistically correct. Moreover, they also allow the testing of more complex effect, e.g. for the variation across many groups (a single p-value for: Is there any difference between four conditions?) or across continuous covariates (a single covariate for: Is a gene expression trajectory in time non-constant?).

2.1 Task (2pts)

Using Scanpy's 'rank_genes_groups' function calculate a t-test and a Wilcoxon rank sum test between Alpha and Beta cells. You can use your own annotation if it fits or the 'cell_type_ann' annotation.

In [100]:
# Add code
sc.tl.rank_genes_groups(adata, groupby='manual_celltype_annotation', method='t-test')
sc.tl.rank_genes_groups(adata, groupby='manual_celltype_annotation', method='wilcoxon')

sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
No description has been provided for this image

Question

What is the difference in assumptions between these tests?

Answer The t-test assumes equal variance and that gene expression is roughly normally distributed within each group The Wilcoxon test does not assume normal distribution and is based on ranks rather than raw values

2.2 Task (2pts)

Now plot the top 30 differentially expressed genes for both of the tests. Do they overlap? What are the top 3 differentially expressed genes?

In [101]:
# Add code
# t test
sc.tl.rank_genes_groups(adata, groupby='manual_celltype_annotation', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=30, sharey=False)

# wilcoxon
sc.tl.rank_genes_groups(adata, groupby='manual_celltype_annotation', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=30, sharey=False)
No description has been provided for this image
No description has been provided for this image

2.3 Question (1pt)

We calculated the differentially expressed genes using the normalized counts. Should differentially expressed genes be calculated using the unnormalized or normalized data? Explain your reasoning.

=> it should be calculated using normalised data because normalisation removes bias, stabilises the variance, and allows for better comparison of data between cells, as the raw counts could vary a lot there.

3. Trajectory Inference (5pts)¶

Trajectory inference approaches analyze genome-wide omics data from thousands of single cells and computationally infer the order of these cells along developmental trajectories.

Many methods for the inference of trajectories exist based on statistical approaches, RNA velocity or optimal transport (see advanced topics below). Here, we introduce diffusion pseudotime, an efficient way of robustly estimating the ordering according to diffusion pseudotime, which measures the transitions between cells using diffusion-like random walks.

Convert UMAP indices to arrays.

In [102]:
umap_0 = [term[0] for term in adata.obsm['X_umap']]
umap_1 = [term[1] for term in adata.obsm['X_umap']]

Set root cell to the cell with the smallest value in the first UMAP component and compute DPT.

3.1 Question (1pt)

Why are we selecting the cell with the smallest value in the first UMAP component as the root cell?

Answer: because it represents a good starting point / initial state

In [103]:
adata.uns['iroot'] = np.flatnonzero(umap_0== max(umap_0))[0]
sc.tl.dpt(adata = adata)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
In [104]:
sc.pl.umap(adata, color=['dpt_pseudotime'])
No description has been provided for this image

3.2 Task (2pts)

Plot the diffusion pseudotime and the cell types based on the component combinations 1,2 and 2,3 in a UMAP.

In [105]:
sc.pl.diffmap(adata, color="manual_celltype_annotation", components=["1,2", "2,3"])
No description has been provided for this image

Single-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions. PAGA maps preserve the global topology of data, allow analyzing data at different resolutions, and result in much higher computational efficiency of the typical exploratory data analysis workflow. In simpler words, PAGA provides an estimation of the connectivity of clusters while maintaining the cluster distribution in space.

3.3 Task (2pts)

Calculate and plot a PAGA graph with Scanpy using the 'cell_type' annotation. Which cells are connected? What does the thickness of the arrows mean? What does the result tell you about the biology? Attempt an interpretation.

Answer thicker arrows = higher confidence in prediction / strong biological link => Ngn3 high has more connections and is central, so this could indicate that it can transpose into different endocrine outcomes => Alpha / Beta / Delta / Epsilon clusters have fewer connections, so this could mean they will not differentiate much anymore

In [106]:
sc.tl.paga(adata, groups="manual_celltype_annotation")
In [107]:
sc.pl.paga(adata)
No description has been provided for this image

4. Advanced Topics (Optional)¶

The advanced topics are not part of the core exercise sheet and are entirely optional. However, they do award bonus points which can offset point losses in the earlier exercises. By design they do not necessarily introduce all concepts in depth. However, references to core papers and documentation are provided to encourage exploration.

4.1 RNA Velocity (8pts)¶

Here you will learn the basics of RNA velocity analysis. Check chapter 14 of the best practices book as well.

RNA abundance is a powerful indicator of the state of individual cells. Single-cell RNA sequencing can reveal RNA abundance with high quantitative accuracy, sensitivity and throughput. However, this approach captures only a static snapshot at a point in time, posing a challenge for the analysis of time-resolved phenomena such as embryogenesis or tissue regeneration. Here we show that RNA velocity—the time derivative of the gene expression state—can be directly estimated by distinguishing between unspliced and spliced mRNAs in common single-cell RNA sequencing protocols. RNA velocity is a high-dimensional vector that predicts the future state of individual cells on a timescale of hours.

For illustration, it is applied to endocrine development in the pancreas, with lineage commitment to four major fates: α, β, δ and ε-cells.
See here for more details.

To analyze the RNA velocity of our Pancreas dataset we will use scVelo.

In [ ]:
!pip install git+https://github.com/theislab/scvelo
In [ ]:
!pip install -U numpy
In [110]:
import numpy
numpy.__version__
Out[110]:
'2.3.4'
In [111]:
import scvelo as scv
In [112]:
scv.settings.verbosity = 3
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization

The analysis is based on the in-built pancreas data as used above. Let's get a clean and already annotated copy.

In [113]:
adata = scv.datasets.pancreas()
adata
100%|██████████| 50.0M/50.0M [00:01<00:00, 32.0MB/s]
Out[113]:
AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
    var: 'highly_variable_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'distances', 'connectivities'

The proportion of unsplied and spliced counts is the basis of RNA velocity. Let's have a look:

In [114]:
scv.pl.proportions(adata)
No description has been provided for this image

Here, the proportions of spliced/unspliced counts are displayed. Depending on the protocol used (Drop-Seq, Smart-Seq), we typically have between 10%-25% of unspliced molecules containing intronic sequences. We also advice the examination of the variations on cluster level to verify consistency in splicing efficiency. Here, we find variations as expected, with slightly lower unspliced proportions at cycling ductal cells, then higher proportion at cell fate commitment in Ngn3-high and Pre-endocrine cells where many genes start to be transcribed.

Preprocessing requisites consist of gene selection by detection (with a minimum number of counts) and high variability (dispersion), normalizing every cell by its total size and logarithmizing X. Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched.

All of this is summarized in a single function scv.pp.filter_and_normalize, which essentially runs the following:

scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)

Further, we need the first and second order moments (means and uncentered variances) computed among nearest neighbors in PCA space, summarized in scv.pp.moments, which internally computes scv.pp.pca and scv.pp.neighbors. First order is needed for deterministic velocity estimation, while stochastic estimation also requires second order moments.

In [115]:
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
Filtered out 20801 genes that are detected 20 counts (shared).
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[115], line 1
----> 1 scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
      2 scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

File ~/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/scvelo/preprocessing/utils.py:416, in filter_and_normalize(data, min_counts, min_counts_u, min_cells, min_cells_u, min_shared_counts, min_shared_cells, retain_genes, layers_normalize, copy, **kwargs)
    414 if layers_normalize is not None and "enforce" not in kwargs:
    415     kwargs["enforce"] = True
--> 416 normalize_per_cell(adata, layers=layers_normalize, **kwargs)
    418 return adata if copy else None

TypeError: normalize_per_cell() got an unexpected keyword argument 'n_top_genes'

Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic'). For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

The solution to the full dynamical model is obtained by setting mode='dynamical', which requires to run scv.tl.recover_dynamics(adata) beforehand. We skip the dynamic model in this exercise sheet.

4.1.1 Task (1pt)

Compute the velocities with scVelo. Consider to use scv.tl.velocity.

In [116]:
scv.tl.velocity(adata, mode='stochastic')
Normalized count data: X, spliced, unspliced.
computing neighbors
/home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/scvelo/tools/velocity.py:326: DeprecationWarning: Automatic neighbor calculation is deprecated since scvelo==0.4.0 and will be removed in a future version of scVelo. Please compute neighbors first with Scanpy.
  moments(adata)
/home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/scvelo/preprocessing/moments.py:71: DeprecationWarning: `neighbors` is deprecated since scvelo==0.4.0 and will be removed in a future version of scVelo. Please compute neighbors with Scanpy.
  neighbors(
    finished (0:00:00) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:06) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing velocities
/home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/scvelo/tools/optimization.py:184: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  gamma[i] = np.linalg.pinv(A.T.dot(A)).dot(A.T.dot(y[:, i]))
    finished (0:00:10) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)

The computed velocities are stored in adata.layers just like the count matrices.

The combination of velocities across genes can then be used to estimate the future state of an individual cell. In order to project the velocities into a lower-dimensional embedding, transition probabilities of cell-to-cell transitions are estimated. That is, for each velocity vector we find the likely cell transitions that are accordance with that direction. The transition probabilities are computed using cosine correlation between the potential cell-to-cell transitions and the velocity vector, and are stored in a matrix denoted as velocity graph. The resulting velocity graph has dimension $n_{obs} \times n_{obs}$ and summarizes the possible cell state changes that are well explained through the velocity vectors (for runtime speedup it can also be computed on reduced PCA space by setting approx=True).

4.1.2 Task (1pt)

Compute the velocity graph with scVelo. Consider to use scv.tl.velocity_graph

In [117]:
scv.tl.velocity_graph(adata, n_jobs=1)
computing velocity graph (using 1/8 cores)
WARNING: Unable to create progress bar. Consider installing `tqdm` as `pip install tqdm` and `ipywidgets` as `pip install ipywidgets`,
or disable the progress bar using `show_progress_bar=False`.
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=52657) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
    finished (0:00:29) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

For a variety of applications, the velocity graph can be converted to a transition matrix by applying a Gaussian kernel to transform the cosine correlations into actual transition probabilities. You can access the Markov transition matrix via scv.utils.get_transition_matrix.

As mentioned, it is internally used to project the velocities into a low-dimensional embedding by applying the mean transition with respect to the transition probabilities, obtained with scv.tl.velocity_embedding. Further, we can trace cells along the Markov chain to their origins and potential fates, thereby getting root cells and end points within a trajectory, obtained via scv.tl.terminal_states.

Finally, the velocities are projected onto any embedding, specified by basis, and visualized in one of these ways:

  • on cellular level with scv.pl.velocity_embedding,
  • as gridlines with scv.pl.velocity_embedding_grid,
  • or as streamlines with scv.pl.velocity_embedding_stream.

Note, that the data has an already pre-computed UMAP embedding, and annotated clusters. When applying to your own data, these can be obtained with scv.tl.umap and scv.tl.louvain. For more details, see the scanpy tutorial. Further, all plotting functions are defaulted to using basis='umap' and color='clusters', which you can set accordingly.

4.1.3 Task (1pt)

Plot the velocity embedding as a stream and examine the result. Which proposed transitions can you see? Is the result biologically sound? Consider to use scv.pl.velocity_embedding_stream.

In [118]:
scv.pl.velocity_embedding_stream(adata, basis="umap", color="clusters")
computing velocity embedding
    finished (0:00:01) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
No description has been provided for this image

The most fine-grained resolution of the velocity vector field we get at single-cell level, with each arrow showing the direction and speed of movement of an individual cell.

4.1.4 Task (2pts)

Plot the velocity embedding and examine the Ngn3-cells, the α-cells and transient β-cells. What do you see? Consider to use scv.pl.velocity_embedding.

In [119]:
# Plot velocity embedding
scv.pl.velocity_embedding(adata, basis="umap", color="clusters")
No description has been provided for this image

This is perhaps the most important part as we advise not to limit biological conclusions to the projected velocities, but to examine individual gene dynamics via phase portraits to understand how inferred directions are supported by particular genes.

See the gif here to get an idea of how to interpret a spliced vs. unspliced phase portrait. Gene activity is orchestrated by transcriptional regulation. Transcriptional induction for a particular gene results in an increase of (newly transcribed) precursor unspliced mRNAs while, conversely, repression or absence of transcription results in a decrease of unspliced mRNAs. Spliced mRNAs is produced from unspliced mRNA and follows the same trend with a time lag. Time is a hidden/latent variable. Thus, the dynamics needs to be inferred from what is actually measured: spliced and unspliced mRNAs as displayed in the phase portrait.

Now, let us examine the phase portraits of some marker genes, visualized with
scv.pl.velocity(adata, gene_names) or scv.pl.scatter(adata, gene_names).

4.1.5 Task (1pt)

Plot the phase portraits for the genes 'Cpe', 'Gnao1', 'Ins2' and 'Adk' with scVelo. Consider to use scv.pl.scatter.

In [120]:
# Plot phase portraits
scv.pl.scatter(adata, basis=['Cpe', 'Gnao1', 'Ins2', 'Adk'], color="clusters", frameon=False)
No description has been provided for this image

The black line corresponds to the estimated 'steady-state' ratio, i.e. the ratio of unspliced to spliced mRNA abundance which is in a constant transcriptional state. RNA velocity for a particular gene is determined as the residual, i.e. how much an observation deviates from that steady-state line. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

For instance Cpe explains the directionality in the up-regulated Ngn3 (yellow) to Pre-endocrine (orange) to β-cells (green), while Adk explains the directionality in the down-regulated Ductal (dark green) to Ngn3 (yellow) to the remaining endocrine cells.

4.1.5 Task (2pts)

Plot a scatterplot of the gene 'Cpe' colored by 'clusters' and 'velocity'. Add an outline for 'Ngn3 high EP, Pre-endocrine, Beta'. What do you see? Consider to use scv.pl.scatter.

In [121]:
# Plot scatterplot
scv.pl.scatter(adata, basis=["Cpe"], color=["clusters", "velocity"], frameon=False)
No description has been provided for this image

4.2 Cell Fate Mapping (10pts)¶

Fate mapping describes the studying of embryonic origin of adult tissue, cell types and structures. Here, the "fate" of every cell is mapped against the embryo highlighting which part of the embryo will develop into which tissue. This process is also commonly known as cell lineage tracing at the single-cell level where it is also commonly applied to trace tumor development.

One of several tools to determine cell fates is CellRank.

This tutorial shows how to work with CellRank using the low-level mode. We will interact directly with CellRank's two main modules, kernels and estimators. For more info on CellRank, see the documentation or read the preprint.

The data we use here comes from Bastidas-Ponce et al. (2018). The AnnData object we download has been preprocessed already and velocities have been computed using scVelo's dynamical model, see the scVelo tutorial on this for more details.

In [122]:
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np

scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo')
cr.settings.verbosity = 2
/home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/cellrank/_utils/_lineage.py:168: DeprecationWarning: numpy.core.overrides is deprecated and has been renamed to numpy._core.overrides. The numpy._core namespace contains private NumPy internals and its use is discouraged, as NumPy internals can change without warning in any release. In practice, most real-world usage of numpy.core is to access functionality in the public NumPy API. If that is the case, use the public NumPy API. If not, you are using NumPy internals. If you would still like to access an internal attribute, use numpy._core.overrides.ARRAY_FUNCTIONS.
  from numpy.core.overrides import ARRAY_FUNCTIONS
/home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/cellrank/pl/_heatmap.py:11: DeprecationWarning: Please import `convolve` from the `scipy.ndimage` namespace; the `scipy.ndimage.filters` namespace is deprecated and will be removed in SciPy 2.0.0.
  from scipy.ndimage.filters import convolve

First, we need to get the data. The following commands will download the adata object and save it under datasets/endocrinogenesis_day15.5_preprocessed.h5ad.

In [123]:
adata = cr.datasets.pancreas(kind="preprocessed")
scv.utils.show_proportions(adata)
adata
try downloading from url
https://figshare.com/ndownloader/files/25030028
... this may take a while but only happens once
100%|██████████| 140M/140M [00:04<00:00, 30.9MB/s] 
Abundance of ['unspliced', 'spliced']: [0.19 0.81]
Out[123]:
AnnData object with n_obs × n_vars = 2531 × 2000
    obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'dpt_pseudotime'
    var: 'highly_variable_genes', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'fit_r2', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'velocity_genes'
    uns: 'clusters_colors', 'clusters_fine_colors', 'diffmap_evals', 'iroot', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_diffmap', 'X_pca', 'X_umap', 'velocity_umap'
    varm: 'PCs', 'loss'
    layers: 'Ms', 'Mu', 'fit_t', 'fit_tau', 'fit_tau_', 'spliced', 'unspliced', 'velocity', 'velocity_u'
    obsp: 'connectivities', 'distances'

As stated above, this data has been preprocessed already. We can inspect scVelo's computed velocities:

In [124]:
scv.pl.velocity_embedding_stream(adata)
No description has been provided for this image

CellRank is a package for analyzing directed single cell data, whereby we mean single cell data that can be respresented via a directed graph. Most prominently, this is the case for single cell data where velocities have been computed - we can use these to direct the KNN graph. However, there are other situations in which we can inform the KNN graph of the direciton of the process, using i.e. pseudotime (see Palantir) or information obtained via mRNA labeling with e.g. scSLAM-seq, scEU-seq or sci-fate. Because we wanted CellRank to be widely applicable, no matter how directionality was introduced to the data, we split it up into two main modules, kernels and estimators. In short, kernels allow you to compute a (directed) transition matrix, whereas estimators allow you to analyze it.

To construct a transition matrix, CellRank offers a number of kernel classes in cellrank.tl.kernels. Currently implemented are the following:

  • VelocityKernel: compute transition matrix based on RNA velocity.
  • ConnectivityKernel: compute symmetric transition matrix based on transcriptomic similarity (essentially a DPT kernel).
  • PalantirKernel: mimics Palantir. Uses pseudotime to direct the KNN graph.

These kernels can be combined by simply using the + or * operator, we will demonstrate this below. To find out more, check out the API. Note that the kernel classes are designed to be easy to extend to incoporate future kernels based on e.g. mRNA labeling or other sources of directionality. Let's start with the VelocityKernel:

In [126]:
!pip install "numpy<2.0.0"
/usr/lib/python3.12/pty.py:95: DeprecationWarning: This process (pid=52657) is multi-threaded, use of forkpty() may lead to deadlocks in the child.
  pid, fd = os.forkpty()
Requirement already satisfied: numpy<2.0.0 in /home/emma/PycharmProjects/gobi/.venv/lib/python3.12/site-packages (1.26.4)

[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: pip install --upgrade pip
In [127]:
from cellrank.kernels import VelocityKernel
vk = VelocityKernel(adata)
# error for some reason
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[127], line 2
      1 from cellrank.kernels import VelocityKernel
----> 2 vk = VelocityKernel(adata)

File ~/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/cellrank/kernels/_velocity_kernel.py:64, in VelocityKernel.__init__(self, adata, backward, attr, xkey, vkey, **kwargs)
     55 def __init__(
     56     self,
     57     adata: AnnData,
   (...)     62     **kwargs: Any,
     63 ):
---> 64     super().__init__(
     65         adata,
     66         backward=backward,
     67         xkey=xkey,
     68         vkey=vkey,
     69         attr=attr,
     70         **kwargs,
     71     )
     72     self._logits: Optional[np.ndarray] = None

File ~/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/cellrank/kernels/mixins/_kernel.py:71, in BidirectionalMixin.__init__(self, backward, *args, **kwargs)
     70 def __init__(self, *args: Any, backward: bool = False, **kwargs: Any):
---> 71     super().__init__(*args, **kwargs)
     72     if not isinstance(backward, (bool, np.bool_)):
     73         raise TypeError(f"Expected `backward` to be `bool`, found `{type(backward).__name__}`.")

File ~/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/cellrank/kernels/_base_kernel.py:608, in Kernel.__init__(self, adata, parent, **kwargs)
    606 self._adata = adata
    607 self._n_obs = adata.n_obs
--> 608 self._read_from_adata(**kwargs)

File ~/PycharmProjects/gobi/.venv/lib/python3.12/site-packages/cellrank/kernels/_velocity_kernel.py:94, in VelocityKernel._read_from_adata(self, xkey, vkey, attr, gene_subset, **kwargs)
     92 self._xdata = self._extract_data(key=xkey, attr=attr, subset=gene_subset)
     93 self._vdata = self._extract_data(key=vkey, attr=attr, subset=gene_subset)
---> 94 np.testing.assert_array_equal(
     95     x=self._xdata.shape,
     96     y=self._vdata.shape,
     97     err_msg=f"Shape mismatch: {self._xdata.shape} vs {self._vdata.shape}",
     98 )
    100 nans = np.isnan(np.sum(self._vdata, axis=0))
    101 if np.any(nans):

TypeError: assert_array_equal() got an unexpected keyword argument 'x'

To learn more about this object, we can print it:

In [ ]:
print(vk)

There is not very: much there yet. We can change this by computing the transition matrix.

4.2.1 Task: (2pts)

Compute the transition matrix and examine the velocity kernel again. Consider to use vk.compute_transition_matrix.

In [ ]:
vk.compute_transition_matrix()
# cannot run this though
In [ ]:
vk

There's a lot more info now! Explain in a couple of words each what the mode, backward and softmax_scale properties describe.

4.2.2 Task (1pt)

Import the connectivity kernel form cellrank.kernels and use it to compute a transition matrix. Name its output ck.

In [ ]:
from cellrank.kernels import ConnectivityKernel

# ck = ...

4.2.3 Task (1pt)

Define a variable combined_kernel which combines both kernels, but weights the velocity kernel by 80% and the connectivity kernel by 20%.

In [ ]:
#combined_kernel = 0.8 * ... + 0.2 * ...
#combined_kernel
Out[ ]:
(0.8 * VelocityKernel[n=2531, model='deterministic', similarity='correlation', softmax_scale=3.974] + 0.2 * ConnectivityKernel[n=2531, dnorm=True, key='connectivities'])

Estimators take a kernel object and offer methods to analyze it. The main objective is to decompose the state space into a set of metastable states (also called macrostates) that represent the slow-time scale dynamics of the process. A subset of these metastable states will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in cellrank.tl.estimators:

  • CFLARE: Clustering and Filtering Left And Right Eigenvectors. Heuristic method based on the spectrum of the transition matrix.
  • GPCCA: Generalized Perron Cluster Cluster Analysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, see Reuter et al. (2018).

For more information on the estimators, have a look at the API. We will demonstrate the GPCCA estimator here, however, the CFLARE estimator has a similar set of methods (which do different things internally). Let's start by initializing a GPCCA object based on the combined_kernel we constructed above, i.e. the one that uses both RNA velocities and connectivities.

In [ ]:
from cellrank.estimators import GPCCA
g = GPCCA(combined_kernel)
print(g)
GPCCA[kernel=(0.8 * VelocityKernel[n=2531] + 0.2 * ConnectivityKernel[n=2531]), initial_states=None, terminal_states=None]

In addition to the information about the kernel it is based on, this prints out the number of states in the underlying Markov chain. GPCCA needs a real sorted Schur decomposition to work with.

4.2.4 Task (1pt)

Compute a Schur decomposition of the transition matrix using the g object from above and plot the eigenvalues in complex plane. The CellRank documentation will help.

In [ ]:
# TODO: add code

To compute the Schur decomposition, there are two methods implemented

  • method='brandts': use scipy.linalg.schur to compute a full real Schur decomposition and sort it using a python implementation of Brandts (2002). Note that scipy.linalg.schur only supports dense matrices, so consider using this for small cell numbers (<10k).
  • method='krylov': use an interative, krylov-subspace based algorightm provided in SLEPc to directly compute a partial, sorted, real Schur decomposition. This works with sparse matrices and will scale to extremly large cell numbers.

The real Schur decomposition for transition matrix T is given by Q U Q**(-1), where Q is orthogonal and U is quasi-upper triangular, which means it's upper triangular except for 2x2 blocks on the diagonal. 1x1 blocks on the diagonal represent real eigenvalues, 2x2 blocks on the diagonal represent complex eigenvalues. Above, we plotted the top 20 eigenvalues of the matrix T to see whether there is an apparent eigengap. In the present case, there seems to be such a gap after the first 3 eigenvalues. We can visualize the corresponding Schur vectors in the embedding:

In [ ]:
g.plot_spectrum()

These vectors will span an invariant subspace, let's call it X (Schur vectors in the columns). The next step in GPCCA is to find a linear combination of these vectors such that the Markov chain defined on the subset of states has large selt-transition probability.

4.2.5 Task (2pts)

Compute and plot the macrostates using 3 states. Afterwards plot the individual states. Finally, plot only the most likely cells for each state (discrete).

In [ ]:
# Compute and plot the macrostates here.

Finally, save your results in the AnnData object using the appropriate CellRank function.

Once we found the terminal states in this dataset, the next big questions is: how likely is each cell to develop towards each of these? In CellRank, we compute this via so-called "absorption probabilities".

4.2.6 Task (2pts)

Compute absorption probabilities using the g object and visualise them in the embedding, so that we have one plot per lineage

In [ ]:
# Calculate absorption probabilites here.
In [ ]:
g.plot_fate_probabilities(same_plot=False)

Once we have set the absorption_probabilities, we can correlate them against all genes to find potential lineage drivers. Below, we show how to do this for just one state. Note that results are written to the .var attribute of adata:

In [ ]:
alpha_drivers = g.compute_lineage_drivers(lineages='Alpha', return_drivers=True)
alpha_drivers.sort_values(by="Alpha_corr", ascending=False)

We can look at some of the identified genes:

In [ ]:
g.plot_lineage_drivers("Alpha", n_genes=5)

To find genes which may be involved in fate choice early on, we could restrict the correlation to a subset of our clusters using the cluster_key and clusters parameters.

4.2.7 Task (1pt)

To bring it all together visualize the fate probabilites towards all terminal states jointly in a single plot with a circular embedding using CellRank. Consider to use cr.pl.circular_projection.

In [ ]:
# Plot circular embedding
In [ ]: